Main goals of the session

  1. Understand the difference between differences and substitutions
  2. Estimate corrected evolutionary distances
  3. Select the best-fit nucleotide substitution model for a set of aligned sequences

1. Practicals organization

The main aim of this session is to understand the importance of “substitution models” in molecular evolution. In particular, exercises will focus on the important distinction between the number of differences observed when comparing nucleotide sequences and the actual number of substitutions that have occurred in their divergence. First, you will work with the Jukes and Cantor model and the famous distance correction based on this model. You will also simulate nucleotide sequences under a more complex substitution model and use “Model Test” to select the best-fitting model underlying the data.

In this lesson we will assume that

  1. The length of the sequences studied is finite (i.e. multiple mutations can occur at the same position, some of them even back to an ancestral nucleotide).

  2. The sequences studied have evolved independently since their divergence from a common ancestor (i.e. the substitutions that have occurred in each descendant lineage are independent).

  3. the mutation rate is constant over time and the same for all lineages and positions.


Throughout the document you will see different icons whose meaning is:

: Additional or useful information

: Practical exercise

: Hint to solve an exercise or to do a task

: Slot to answer a question

: Problem or task to be solved


2. Installing R packages

You can use either the R console in the terminal or RStudio for this exercise. If you don’t have R installed, you can download the appropriate package for your system here. To install RStudio, go to this page and follow the instructions.

Before starting the exercises, you will need to install the phangorn library. This package includes a function for comparing different nucleotide substitution models.

Open the R console in the terminal (or in RStudio) and type:

  install.packages("phangorn") # Do not run if you have installed phangorn in practice 4.

The most popular and flexible software for performing model selection is ModelTest (Darriba et al. 2019). The latest version of this program can be downloaded here. However, for practical reasons, we decided not to use this software in these exercises, as it requires compilation and depends on some third-party dependencies.


3. Example 1

Given that DNA sequences can only present four different states (A, T, C, G), the number of observed and actual substitutions can be different when comparing sequences of finite length (the more divergent the sequences the greater this difference will be).

  1. Let’s see an example of this problem. Read carefully the explanation of the different steps we go through for the completion of the simulations and reproduce the process in the R terminal (or RSsudio):

    • HOW MANY SUBSTITUTIONS actually happened in 2 sequences that diverged G generations ago from a common ancestor (ancestral sequence)? Let’s consider a substitution rate (r) is \(10^-6\) (substitutions per site per generation). Note that we do not need to know the sequence of these two descendants to calculate the expecter number of substitutions (Figure 1), which only depends on the mutation rate and the time since the split. The expected number of substitutions in each lineage after G generations can be obtained using the Poisson distribution (which describe the distribution of rare events in a large population):

      subst.rate <- 1e-6
      G <- 1e6
      L <- 100
      n.subst.lineage.1 <- rpois(n=1,lambda=subst.rate*L*G)
      n.subst.lineage.1
      n.subst.lineage.2 <- rpois(n=1,lambda=subst.rate*L*G)
      n.subst.lineage.2
    • Then, the actual number of substitutions between the two sequences should be the sum of those occurred in both lineages since the split:

      total.subst <- n.subst.lineage.1 + n.subst.lineage.2
      total.subst

Figure 1. Expected number of substitutions in two lineages after their split from a common ancestor. The actual number of substitutions in each lineage will be a random sampling from a Poisson distribution with lambda = r x T


Questions

1. Which would be the actual evolutionary divergence (K= number of substitutions per site) between the two descendant sequences in your simulated replicate?

Answer:

2. If you could compare the sequences after time G, would you see all these substitutions? Why?

Answer:

3. Should the estimate in Question 1 be corrected, and if so, which substitution model would you use?

Answer:


4. Example 2

In this section you will simulate the evolution of two DNA sequences under the same scenario as in Example 1 and the Jukes and Cantor model. In this case, you start with an ancestral sequence (a simulated sequence; 100 bp), estimate the number of substitutions in the two lineages, randomly choose a position in the sequence to introduce the substitution, and replace the existing nucleotide with one of the other three (also randomly).

  1. Enter the R console and type the following commands:

    • generating the ancestral sequence (100 nucleotides):

       Tnt <- c("A","C","G","T")
       L <- 100
       seq <- sample(x=Tnt, size=L, replace=TRUE, prob=c(0.25,0.25,0.25,0.25))
       seq
    • set the Jukes and Cantor (JC) Rate matrix Q:

      Q <- matrix(1/3, ncol=4, nrow=4, dimnames=list(c("A","C","G","T"), c("A","C","G","T")))
       for(x in 1:4) {Q[x,x] <- 0; Q[x,x] <- -sum(Q[x,])}
       Q
  2. Let’s write an R function defined as: substitutions.on.sequence () to randomly choose the specific nucleotide change (e.g. A->G or A->C, etc..) for each substitution and the site among the L positions in the ancestral sequence seq where the change occurred (=to generate the two descendant sequences). The parameters of this function will be: the ancestral sequence, the JC rate matrix and the four possible states (nucleotides):

    substitutions.on.sequence <- function(seq,substitutions,Q,Tnt) {

    seq is a vector of symbols with length L in which each position can only take four possible states (e.g.., A, T, A, G, C …etc.), substitutions is the total number of expected substitutions, Q is the instantaneous substitution rate matrix under a particular model (here the JC) and Tnt is a vector of C values containing the different symbols accepted in the sequence (the four nucleotides).

    • The script continues as follows (NOTE that you defined seqin the example 2):

           seq1 <- seq
           len <- length(seq)

    • now, we select the sites where the substitutions will fall on (here we use a uniform probability, that is, all sites are equally likely to mutate):

           position.substitution <- floor(runif(n=substitutions ,min=1, max=len))

    • then, we check the nucleotide that is in the position that will mutate(nti = the position in the matrix that corresponds to the nucleotide that will change, e.g. A = 1) and therefore, which nucleotides can replace it (’rest` = the positions in the matrix that correspond to the three possible changes, e.g. C=2, T=3 and G=4):

           number.symbols <- c(1:length(Tnt))
           for(ps in position.substitution) {
                nti<-which(Tnt==seq1[ps])
                rest.symbols<-number.symbols[-nti]

    • In the transition matrix Q, the sum of instantaneous rates for each of the three possible changes (SQ) is equal to 1 (remember that in Q, we do not consider the diagonal for this sum). Consider cQ as the cumulative probability of each of these changes (this is a trick to sample from an uniform distribution with probabilities 0.3333; see the loop below):

                SQ<- 1
                cQ<- 0
                ntj<- 1

    • based on this probabilities, which is the new nucleotide (the ntj position in the matrix) after the substitution?:

                ranp <- runif(1)
                while(ntj > 0) {

    • calculate the (cumulative) probability of a substitution from nucleotide nti to ntj:

                     cQ<-cQ+Q[nti,rest.symbols[ntj]]/SQ
                     if(ranp < cQ) {
                          seq1[ps] <- Tnt[rest.symbols[ntj]]
                          ntj <- -1
                     }
                ntj <- ntj+1
                }
           }
           seq1

      }


Here you can find the complete R code for the function:

substitutions.on.sequence<-function(seq,substitutions,Q,Tnt) {
seq1 <- seq
len <- length(seq)
position.substitution <- floor(runif(n=substitutions,min=1,max=len))
number.symbols <- c(1:length(Tnt))
for(ps in position.substitution) {
   nti <- which(Tnt==seq1[ps])
   rest.symbols <- number.symbols[-nti]
    SQ <- sum(Q[nti,-nti])
    cQ <- 0
    ntj <- 1
    ranp <- runif(1)
    while(ntj > 0) {
      cQ <- cQ+Q[nti,rest.symbols[ntj]]/SQ
      if(ranp < cQ) {
        seq1[ps] <- Tnt[rest.symbols[ntj]]
        ntj <- -1
        }
      ntj <- ntj+1
    }
  }
seq1
}

  1. Now, we can generate the two descendant sequences using the function:

    seq.ancestral <- seq
    seq.lineage.1 <- substitutions.on.sequence(seq.ancestral,n.subst.lineage.1,Q,Tnt)
    seq.lineage.2 <- substitutions.on.sequence(seq.ancestral,n.subst.lineage.2,Q,Tnt)
  2. Now you can compare the three sequences:

    seq.ancestral
    seq.lineage.1
    seq.lineage.2
  3. Finally, we can build a matrix with the evolutionary distances (observed differences) and actual substitutions between the descendant sequences and between these sequences and the ancestral sequence:

    Diff <- matrix(0, nrow=3, ncol=2, dimnames=list(c("1 vs.2","anc vs.1","anc vs.2"),c("OBS","ACTUAL")))
    Diff[1,1] <- sum(seq.lineage.1 != seq.lineage.2)
    Diff[2,1] <- sum(seq.ancestral != seq.lineage.1)
    Diff[3,1] <- sum(seq.ancestral != seq.lineage.2)
    Diff[1,2] <- n.subst.lineage.1 + n.subst.lineage.2
    Diff[2,2] <- n.subst.lineage.1
    Diff[3,2] <- n.subst.lineage.2
    Diff

5. Example 3

In the Jukes and Cantor model, the relationship between observed (p) and actual (K) number of nucleotide substitutions per site is given by the following equation:

K = - \(\frac{3}{4}\) ln (1 – \(\frac{4}{3}\) p)


Using this formula, we can calculate the theoretical curve of the relationship between these two quantities:


Questions

4. Why don’t the simulated values fit perfectly with those expected from the theory (theoretical curves)?

Answer:

5. Based on the plot, which is the main consequence of not correcting divergences for multiple hits when calculating evolutionary rates?

Answer:

6. Looking at the black curve, do you think there is any limit to the Jukes and Cantor correction? Could this correction be applied in all cases, regardless of divergence times?

Answer:


6. Example 4

Let’s now work with a more complex model, the HKY85 model Hasegawa, Kishino and Yano 1985, which allows unequal base frequencies and distinguishes between transition and transversion rates.

Exercise

Simulate an ancestral DNA sequence (ex4.ancestral.seq) and two descendant sequences (ex4.seq.lineage.1 and ex4.seq.lineage.2) as in the example 2 (step 1) but using the nucleotide frequencies in the example 4. Then, use function substitutions.on.sequence with the Q.HKY matrix to generate two descendant sequences after G generations.

Generate a file (simulated.fasta) with the three sequences in FASTA format (the ancestral and the two descendant sequences). To do that, you can use the function writeLines() with sep="", to print the clean sequence in the screen. Copy and paste the sequences in the file in a correct fasta format.



Questions

7. Is the HKY the best model?

Answer:

9. Do you think it is possible that Model Test estimate a model different from the HKY despite having simulated our sequences under this model? Why?

Answer:

10. Are the parameters of the model (probabilities and frequencies) correctly estimated by Model Test (=are the values similar to those of matrix Q.HKY)?

Answer:


Deliver info

Deliver this document in AULAESCI with your answers

Deadline: June 28, 2024 - 23:59

Doubts?